13 research outputs found
A Review and Evaluation of Elastic Distance Functions for Time Series Clustering
Time series clustering is the act of grouping time series data without
recourse to a label. Algorithms that cluster time series can be classified into
two groups: those that employ a time series specific distance measure; and
those that derive features from time series. Both approaches usually rely on
traditional clustering algorithms such as -means. Our focus is on distance
based time series that employ elastic distance measures, i.e. distances that
perform some kind of realignment whilst measuring distance. We describe nine
commonly used elastic distance measures and compare their performance with
k-means and k-medoids clustering. Our findings are surprising. The most popular
technique, dynamic time warping (DTW), performs worse than Euclidean distance
with k-means, and even when tuned, is no better. Using k-medoids rather than
k-means improved the clusterings for all nine distance measures. DTW is not
significantly better than Euclidean distance with k-medoids. Generally,
distance measures that employ editing in conjunction with warping perform
better, and one distance measure, the move-split-merge (MSM) method, is the
best performing measure of this study. We also compare to clustering with DTW
using barycentre averaging (DBA). We find that DBA does improve DTW k-means,
but that the standard DBA is still worse than using MSM. Our conclusion is to
recommend MSM with k-medoids as the benchmark algorithm for clustering time
series with elastic distance measures. We provide implementations in the aeon
toolkit, results and guidance on reproducing results on the associated GitHub
repository
A tale of two toolkits, report the third: on the usage and performance of HIVE-COTE v1.0
The Hierarchical Vote Collective of Transformation-based Ensembles
(HIVE-COTE) is a heterogeneous meta ensemble for time series classification.
Since it was first proposed in 2016, the algorithm has undergone some minor
changes and there is now a configurable, scalable and easy to use version
available in two open source repositories. We present an overview of the latest
stable HIVE-COTE, version 1.0, and describe how it differs to the original. We
provide a walkthrough guide of how to use the classifier, and conduct extensive
experimental evaluation of its predictive performance and resource usage. We
compare the performance of HIVE-COTE to three recently proposed algorithms
The Canonical Interval Forest {(CIF)} Classifier for Time Series Classification
Time series classification (TSC) is home to a number of algorithm groups that
utilise different kinds of discriminatory patterns. One of these groups
describes classifiers that predict using phase dependant intervals. The time
series forest (TSF) classifier is one of the most well known interval methods,
and has demonstrated strong performance as well as relative speed in training
and predictions. However, recent advances in other approaches have left TSF
behind. TSF originally summarises intervals using three simple summary
statistics. The `catch22' feature set of 22 time series features was recently
proposed to aid time series analysis through a concise set of diverse and
informative descriptive characteristics. We propose combining TSF and catch22
to form a new classifier, the Canonical Interval Forest (CIF). We outline
additional enhancements to the training procedure, and extend the classifier to
include multivariate classification capabilities. We demonstrate a large and
significant improvement in accuracy over both TSF and catch22, and show it to
be on par with top performers from other algorithmic classes. By upgrading the
interval-based component from TSF to CIF, we also demonstrate a significant
improvement in the hierarchical vote collective of transformation-based
ensembles (HIVE-COTE) that combines different time series representations.
HIVE-COTE using CIF is significantly more accurate on the UCR archive than any
other classifier we are aware of and represents a new state of the art for TSC
Unsupervised Feature Based Algorithms for Time Series Extrinsic Regression
Time Series Extrinsic Regression (TSER) involves using a set of training time
series to form a predictive model of a continuous response variable that is not
directly related to the regressor series. The TSER archive for comparing
algorithms was released in 2022 with 19 problems. We increase the size of this
archive to 63 problems and reproduce the previous comparison of baseline
algorithms. We then extend the comparison to include a wider range of standard
regressors and the latest versions of TSER models used in the previous study.
We show that none of the previously evaluated regressors can outperform a
regression adaptation of a standard classifier, rotation forest. We introduce
two new TSER algorithms developed from related work in time series
classification. FreshPRINCE is a pipeline estimator consisting of a transform
into a wide range of summary features followed by a rotation forest regressor.
DrCIF is a tree ensemble that creates features from summary statistics over
random intervals. Our study demonstrates that both algorithms, along with
InceptionTime, exhibit significantly better performance compared to the other
18 regressors tested. More importantly, these two proposals (DrCIF and
FreshPRINCE) models are the only ones that significantly outperform the
standard rotation forest regressor.Comment: 19 pages, 21 figures, 6 tables. Appendix include
The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances
Time Series Classification (TSC) involves building predictive models for a discrete target variable from ordered, real valued, attributes. Over recent years, a new set of TSC algorithms have been developed which have made significant improvement over the previous state of the art. The main focus has been on univariate TSC, i.e. the problem where each case has a single series and a class label. In reality, it is more common to encounter multivariate TSC (MTSC) problems where the time series for a single case has multiple dimensions. Despite this, much less consideration has been given to MTSC than the univariate case. The UCR archive has provided a valuable resource for univariate TSC, and the lack of a standard set of test problems may explain why there has been less focus on MTSC. The UEA archive of 30 MTSC problems released in 2018 has made comparison of algorithms easier. We review recently proposed bespoke MTSC algorithms based on deep learning, shapelets and bag of words approaches. If an algorithm cannot naturally handle multivariate data, the simplest approach to adapt a univariate classifier to MTSC is to ensemble it over the multivariate dimensions. We compare the bespoke algorithms to these dimension independent approaches on the 26 of the 30 MTSC archive problems where the data are all of equal length. We demonstrate that four classifiers are significantly more accurate than the benchmark dynamic time warping algorithm and that one of these recently proposed classifiers, ROCKET, achieves significant improvement on the archive datasets in at least an order of magnitude less time than the other three
HIVE-COTE 2.0: a new meta ensemble for time series classification
The Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE) is a heterogeneous meta ensemble for time series classification. HIVE-COTE forms its ensemble from classifiers of multiple domains, including phase-independent shapelets, bag-of-words based dictionaries and phase-dependent intervals. Since it was first proposed in 2016, the algorithm has remained state of the art for accuracy on the UCR time series classification archive. Over time it has been incrementally updated, culminating in its current state, HIVE-COTE 1.0. During this time a number of algorithms have been proposed which match the accuracy of HIVE-COTE. We propose comprehensive changes to the HIVE-COTE algorithm which significantly improve its accuracy and usability, presenting this upgrade as HIVE-COTE 2.0. We introduce two novel classifiers, the Temporal Dictionary Ensemble and Diverse Representation Canonical Interval Forest, which replace existing ensemble members. Additionally, we introduce the Arsenal, an ensemble of ROCKET classifiers as a new HIVE-COTE 2.0 constituent. We demonstrate that HIVE-COTE 2.0 is significantly more accurate on average than the current state of the art on 112 univariate UCR archive datasets and 26 multivariate UEA archive datasets
Identification of novel risk loci, causal insights, and heritable risk for Parkinson's disease: a meta-analysis of genome-wide association studies
Background Genome-wide association studies (GWAS) in Parkinson's disease have increased the scope of biological knowledge about the disease over the past decade. We aimed to use the largest aggregate of GWAS data to identify novel risk loci and gain further insight into the causes of Parkinson's disease. Methods We did a meta-analysis of 17 datasets from Parkinson's disease GWAS available from European ancestry samples to nominate novel loci for disease risk. These datasets incorporated all available data. We then used these data to estimate heritable risk and develop predictive models of this heritability. We also used large gene expression and methylation resources to examine possible functional consequences as well as tissue, cell type, and biological pathway enrichments for the identified risk factors. Additionally, we examined shared genetic risk between Parkinson's disease and other phenotypes of interest via genetic correlations followed by Mendelian randomisation. Findings Between Oct 1, 2017, and Aug 9, 2018, we analysed 7·8 million single nucleotide polymorphisms in 37 688 cases, 18 618 UK Biobank proxy-cases (ie, individuals who do not have Parkinson's disease but have a first degree relative that does), and 1·4 million controls. We identified 90 independent genome-wide significant risk signals across 78 genomic regions, including 38 novel independent risk signals in 37 loci. These 90 variants explained 16–36% of the heritable risk of Parkinson's disease depending on prevalence. Integrating methylation and expression data within a Mendelian randomisation framework identified putatively associated genes at 70 risk signals underlying GWAS loci for follow-up functional studies. Tissue-specific expression enrichment analyses suggested Parkinson's disease loci were heavily brain-enriched, with specific neuronal cell types being implicated from single cell data. We found significant genetic correlations with brain volumes (false discovery rate-adjusted p=0·0035 for intracranial volume, p=0·024 for putamen volume), smoking status (p=0·024), and educational attainment (p=0·038). Mendelian randomisation between cognitive performance and Parkinson's disease risk showed a robust association (p=8·00 × 10−7). Interpretation These data provide the most comprehensive survey of genetic risk within Parkinson's disease to date, to the best of our knowledge, by revealing many additional Parkinson's disease risk loci, providing a biological context for these risk factors, and showing that a considerable genetic component of this disease remains unidentified. These associations derived from European ancestry datasets will need to be followed-up with more diverse data. Funding The National Institute on Aging at the National Institutes of Health (USA), The Michael J Fox Foundation, and The Parkinson's Foundation (see appendix for full list of funding sources)